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Abstract —The recent development of multi-agent simulations 
brings about a need for population synthesis. It is a task of 
reconstructing the entire popnlation from a sampling survey of 
limited size (1% or so), supplying the initial conditions from 
which simulations begin. This paper presents a new kernel density 
estimator for this task. Our method is an analogue of the classical 
Brelman-Meisel-Purcell estimator, but employs novel techniques 
that harness the huge degree of freedom which is required 
to model high-dimensional nonlinearly correlated datasets: the 
crossover kernel, the fe-nearest neighbor restriction of the kernel 
construction set and the bagging of kernels. The performance 
as a statistical estimator is examined through real and synthetic 
datasets. We provide an “optimization-free” parameter selection 
rule for our method, a theory of how our method works and 
a computational cost analysis. To demonstrate the usefulness as 
a population synthesizer, our method is applied to a household 
synthesis task for an urban micro-simulator. 

1. Introduction 

In recent years, increasing computational power enables 
us to conduct large-scale multi-agent simulations in highly 
public subjects, e.g., urban planning JT], transportation ||3, 
energy management la, disaster prevention H and welfare 
engineering 0. To carry out multi-agent simulations in such 
highly public subjects, we face difficulties in collecting de¬ 
tailed survey data that supply realistic initial conditions to sim¬ 
ulators. For example, modern urban micro-simulators require 
disaggregate socio-demographics of the study area including 
each and every household’s residential place, family structure, 
car ownership and income as well as person’s age, gender, 
job, daily activities, etc. The complete survey of such massive 
and detailed information is usually impracticable for cost and 
privacy reasons. 

If only a sampling survey is available, we have to re¬ 
cover the entire population from the obtained sample. For 
this purpose. Iterative Proportional Fitting (IPF) 0 and its 
extensions such as Iterative Proportional Updating (IPU) Q 
are widely used in the aforementioned areas. However, the IPF- 
like approach that simply weights, or copies, the sample points 
to synthesize the population cannot reproduce the diversity of 
the original population which might be missed through the 
sampling survey. Moreover, since this approach only accepts 
categorical variables, numerical variables such as age and 
income must be roughly discretized (usually into two to hve). 

One promising alternative is the statistical approach that 
hrst estimates the probability distribution of sample points and 
then resamples the population from the estimated distribution, 
though it is not easy to model complicated distributions of data 
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Fig. 1. Age vs. annual income of 1 % Washington state citizens from the 
American Community Survey 2012 0. Note that values at very high ages or 
incomes are “top-coded” in this survey in order to protect the confidentiality 
of survey respondents. 

for multi-agent simulations. Figure [T] shows the scatter plot of 
person’s age and income in a real city, which are commonly 
used attributes of person agents in simulations. We can see that 
even this very simplihed example defies parametric models. 
The distribution has an income-degenerate part for under 15 
non-working ages while high income earners in middle ages 
constitute a one-sided long tail. Furthermore, some artifact 
lines arise due to “top-coding”. Our problem is much harder. 
The data often consist of 5 to 20 variables and these variables 
are nonlinearly correlated all together. Parametric modeling is 
a formidable task even for domain experts. 

The kernel density estimation has been shown to possess 
excellent flexibility to ht various complex distributions in 
low-dimensional spaces 13 . “Fitting” in this method is to 
select the bandwidth parameters of kernel functions so as to 
minimize some goodness-of-fit measure. In high dimensions, 
this optimization-intensive htting strategy becomes ineffective. 
Each sample point requires a different bandwidth matrix for 
representing a local correlation which varies with locations in 
a nonlinear manner. Due to the large number of parameters 
in the model, the htting procedure involves a large-scale 
nonlinear optimization that requires huge computational efforts 
and limits the applicable dimensionality and shape of the data 
distribution. 

Recently, a new kind of kernels for density estimation, the 


























crossover kernel, has been proposed Col, which employs a 
crossover operator in genetic algorithm as a kernel function 
in kernel density estimation. The crossover kernel does not 
require bandwidth parameters to specify the shape of the 
function; instead, it uses a subset of sample points called 
the kernel construction set (KCS). Hence it has a potential 
to obviate costly fitting procedures and to quickly resample 
high-dimensional data. Unfortunately, there are merely two 
studies on how to choose KCSs. One appeared in Sakuma 
& Kobayashi’s original work; they proposed an EM algorithm 
to calculate sample’s weights that describe the probability of 
which sample point belongs to which KCS. Another is Kimura 
& Matsumura’s ini: they pointed out that the above method 
does not fit non-Gaussian-mixture-like distributions no matter 
how their weights are optimized, and proposed optimizing the 
choice of KCSs itself rather than the weights. Both methods 
rely on optimization, and thus involve heavy fitting procedures 
again. 

The aim of this paper is to develop an “optimization- 
free” crossover kernel for fast and accurate resampling by 
introducing the /c-nearest neighbor restriction of KCS choice. 
We will show that this simple idea works surprisingly well 
for our problem and has a certain theoretical background. Our 
contributions are summarized as follows. 

• We proposed the fc-nearest neighbor restriction of 
KCS choice, and showed that our method is not only 
faster, but also more accurate than conventional Gaus¬ 
sian kernels and Kimura-Matsumura crossover kernel 
through complicated datasets of 2 to 17 dimensions. 

• We examined the parameter sensitivity of our method, 
and gave a rule of thumb for choosing the neighbor¬ 
hood size k and the KCS size m without optimization. 

• We found that setting m < k decreases the general¬ 
ization error in experiments, and showed its rationale 
from the viewpoint of bagging IIT^ . 

• We demonstrated that the simulation accuracy of Ur- 
banSim Ul can be enhanced if our method supplies 
initial households instead of IPU. 

II. Problem Definition, Notations and 
Organization of the Paper 


Case II — a biased sample with marginal frequencies: 
In real-world applications, samples available are often biased 
rather than uniform. This case requires extra information about 
what bias is introduced. We assume the population’s binned 
frequencies F = {fi,GN\bGB}is given, but the binning B 
is applied to marginal distributions of variables and incomplete 
to recover the entire joint distribution. Thus, what we have 
to do in this case is to correct the sampling bias with the 
frequencies F and to extract the variable correlation from the 
sample X, combining them to estimate the entire joint density 
/. In Section rvn we will tackle an application of this type and 
will present a bias correction technique. Incorporated with our 
resampling method developed in Section IIVI this technique is 
applied to a task of generating initial population of households 
for an urban land-use micro-simulation. 

III. Conventional Methods 

Now focusing on Case I, this section reviews existing 
kernel density estimators and crossover kernels. 

A. Kernel Density Estimation 

Kernel density estimators can be classified into two ver¬ 
sions US: the sample point estimator and the balloon estima¬ 
tor. For resampling purpose, our estimate must be a density 
and easy to sample. The sample point estimator satisfies both 
requirements if the kernel is a density which is easy to sample, 
while the balloon estimator is usually not a density even when 
the kernel is m. Thus we concentrate on the former of the 
following form: 

1 "■ 

/(y) = - H^) (1) 

n ^' 

i=l 

where AT is a kernel function to be specified and is a 
bandwidth matrix that is a c? x d symmetric positive-definite 
matrix of parameters that we are estimating. In this paper, we 
consider the multivariate Gaussian kernel; 


Kiy\x,,H,)=N{y\x,,H,) = 


1 


• exp 






( 2 ) 


Throughout this paper, the following assumptions are 
made. A sample X = { tci, ..., £ R'* } of size n(£ N) 

is given, which is taken from an unknown population. The 
population is a set of 1{g N) i.i.d. points with an unknown 
continuous density f{x). We are going to estimate /, say /, 
and resample I points from / to recover the population. The 
problem of population synthesis will be discussed under two 
different situations below. 

Case I - an unbiased sample: The sample X is uni¬ 
formly drawn from the population. The goal is thus to estimate 
the density of X which is identical to /. The subsequent three 
sections of this paper discuss this case. Section |III] reviews 
existing kernel density estimators and crossover kernels. Sec¬ 
tion presents our proposal and its properties. Section 
shows the effectiveness of proposed method by comparing 
to conventional kernel density estimators on some benchmark 
datasets. Mechanisms behind the method are discussed. 


As a measure of goodness-of-fit, the mean integrated 
squared error (MISE) is most commonly usecQ and its 
bias/variance decomposition is given by 


MISE = E 


J (/(y) - /(y)) dy 

[ (E[/(y)] - /(y)) dy 


squared bias 


(3) 


E 


(/(y)-E[/(y)])‘ 


dy. 


variance 


^Various measures based on the Li-errors, the Kullback-Leibler divergence 
and the Hellinger distance are used for this puipose E] In any case, the 
problem that we will point out in this section still arises. 











In the case of fixed matrix bandwidths {H i = • • • = Hn), 
bandwidth selectors using cross-validation m and plug¬ 
in El have been developed. However, when it comes to 
variable bandwidths, existing techniques are very restrictive. 
The Breiman-Meisel-Purcell (BMP) estimator ifTSl is the most 
classical one, which uses scalar bandwidths, i.e.. Hi = h^I, 
and determines hi as a constant multiple of the Euclidean 
distance 5ik from Xi to the fcth nearest other sample point: 

hi = h5ik for fixed h > 0. (4) 

Asymptotically, this is equivalent to choosing hi oc f{xi)~^/'^ 
as n —^ c» m. Abramson El proposed the square root law, 
i.e., choosing hi oc f{xi)~^^‘^ regardless of d, and showed 
that it reduces bias and accelerates the rate of convergence. To 
handle matrix bandwidths. Sain ||9l studied a binning technique 
which decreases the number of parameters by sharing the same 
bandwidth matrix with sample points in the same bin. His 
simulation results clarified that even for a low-dimensional 
(d = 2) moderate sample size (n = 320), the unbiased cross- 
validation tends to yield too small, namely, overfit bandwidth 
matrices. Since the (unbinned) variable bandwidth matrix has 
nd{d+ l)/2 parameters, the situation is much worse in higher 
dimensions. Developing bandwidth selectors for this case is 
currently an open question. 

Another problem in high dimensions is that the relative 
contribution of variance dominates bias in MISE. Sain m 
pointed out that for the fixed scalar bandwidth, the ratio of 
order of bias to order of variance is 4 : d. Therefore, as 
dimensionality increases, bias reduction techniques get less 
importance and variance reduction techniques including the 
bagging El become the core of the estimator. 

B. Crossover Kernels 

In the genetic algorithm literature, Kita et al. Il20l argued 
the desirable behavior of crossover operators and suggested 
a guideline, the preservation of statistics, which tells us that 
the crossover’s parents and children should have the same 
mean and covariance. Specifically, suppose that the parents 
and children are realizations of random variate x and y in 
respectively, and then the following should be satisfied: 

E[tc] = E[y], C[a;] = C[y]. (5) 

Sakuma & Kobayashi El pointed out that the parents 
Xi,..., Xm (realizations of x) implicitly determine the prob¬ 
ability density function K of y and thus K can be considered 
as a data-driven kernel function. In this context, the set of 
parents X' = {xi,, Xm } is called the kernel construction 
set (KCS). If one chooses a crossover so that it satisfies Q and 
K is the Gaussian distribution, then the crossover coincides 
with the maximum likelihood estimator (MLE) of K calculated 
from the KCS. Sakuma & Kobayashi also noticed that UNDX- 
m II 20 I fulhlls these conditions when the KCS size is set to 
be m = d -I- 1. 

The remaining question is how to construct the en¬ 
tire density estimate from crossover kernels. Sakuma & 
Kobayashi El just used the Gaussian mixture model f{y) = 
'^i^iC(iX{y\fii,'Si) as the final estimate, but its parameters 
are optimized by their modified EM algorithm 


in which each Gaussian component at the fth EM step 
X{y\fj,f\ is replaced by the average of crossover kernels 

..., ^ K{y\X^p). (6) 

^ 1^1 

Their KCSs xj^p are chosen at random from X by the 
probability proportional to weights w^*'pi\xj) which are cal¬ 
culated by their E step and describe the possibility that the 
ith component generates the jth sample point. The superiority 
of their EM algorithm over the original one has been shown 
through simulation studies. 

Kimura & Matsumura 01 proposed a more direct way, 
giving the entire estimate by 

1 ^ 

/(y) = ;^E^(y|^')- ( 7 ) 

1=1 

Their KCSs Xi are chosen so as to maximize the log-likelihood 
Sr=r^°S/(®0- starting from a random choice, KCSs are 
randomly remade (one at a time) if it improves the log- 
likelihood. They showed the high flexibility of this approach 
by applying it to a circularly distributed dataset. 

IV. fc-NEAREST Neighbor Crossover Kernel 

As we have seen, the BMP estimator is fast but restricted to 
the scalar bandwidth, whereas the crossover kernel is slow but 
able to handle the covariance matrix. Our idea is to combine 
the strengths of both estimators. 


A. REX Kernel 


As mentioned in Section IIII-BI UNDX-m coincides with 
the Gaussian MLE only when the KCS size is m = d + 1. 
In order to extend this property to arbitrary size, we employ 
REX on for the kernel function. REX generates a resampling 
point y from the KCS X' = {xi,..., x^ } as follows: 

m ^ 

y = /x^,e*--N(0, —) (8) 

where px' = ^ Str E a random number drawn 

from the univariate Gaussian distribution N(0, ^). 

Eollowing Sakuma & Kobayashi ’s El derivation of the 
kernel function of UNDX-m, we get the probability density 
function of y, namely, the REX kernel: 


K{y\X') = n{y\pix',^x') 


1 

V(2^)<idet(Sx^ 


•exp(-i(y-/rx/)^SxHy-Mx')) (9) 

where Sx' = B-x'V ■ Regardless 

of m, thus always K{y\X') is the Gaussian MLE calculated 
from X'. Comparing the REX kernel (|9|l to the Gaussian 
kernel (l2]l, we see that the covariance matrix Sx' in dD is 
corresponding to the bandwidth matrix Hi in (|2]l. Resampling 
from the Gaussian kernel with explicit MLE costs 0{md^+d^) 
timeO the REX kernel with resampler ([8j bypasses the estima- 


^Since the random number generation relies on the Cholesky decomposition 
of Sx'’ it requires O(md^) time to calculate Sx' from X' and 0{d^) time 
the Cholesky decomposition of Sx'- 
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Fig. 2. An example of local correlation. Two typical Si’s of the proposed 
method Go) are shown by solid ellipsoids. The corresponding /li’s of the BMP 
estimator {4) with h = 1 are drawn by dashed circles. The sample points (the 
dots) and the underlying density (the gray scale) are of the ring-like PDF m- 


tion process and enables us to generate a sample point from 
K{y\X') in 0{md) time. 

B. k-Nearest Neighbor Restriction of KCS Choice 

In order to adapt KCSs to data without optimization, we 
choose KCSs from the /c-nearest neighbor (A:-NN) of each 
sample point. This is a simple yet powerful trick to realize 
a crossover kernel analogue of the BMP estimator. It is shown 
in Appendix lAl that the covariance matrix of A:-NN points 
of a sample point Xi is asymptotically of the form: 




I- 


X4 

^ik 


V/OV/^ 


d + 2 (d + 2)2 


P 


■{xi) as n —>• oo. (10) 


For sufficiently large n, we can expect 5ik 1 and thus 
the first 0{5‘^P term in the r.h.s. overwhelms the second 
0{5fff) term. This means that (O approximates to a scalar 
bandwidth of bikjs/d + 2, that is, the BMP bandwidth (IHi 
with h = Ijs/d + 2. Such an asymptotic behavior has a 
strength and a weakness: one hand, our method inherits rich 
asymptotics from the BMP estimator, including the normality, 
the consistency, the leading term of bias/variance, the rate of 
convergence, etc. Il22ll : on the other hand, our matrix extension 
of bandwidth gives no advantage over the BMP estimator in 
asymptopia. 

For small-to-moderate n, the second term in the r.h.s. of 
( [Tol l can be influential when nonlinear correlation presents. 
Notice that the matrix-valued function (V/ ® ^f^)/P in 
the second term captures the local correlation around Xi, 
which is not taken into account in the BMP estimator. It is 
easy to check that the matrix (V/ 0 V/^)(a;) is always of 
rank one unless the gradient X f{x) vanishes, and its unique 
nonzero eigenvalue equals ||V/(a;)||^ and the corresponding 
eigenvector is parallel to V fix). Dividing iV f f^){x) by 
pix) can be thought of as a relativization of the effect. Thus, 
subtracting such a matrix from the BMP bandwidth as in ( fTOl i 
makes the bandwidth shrink in the gradient direction. Figure 
|2] illustrates how this effect captures the local correlation. 
In this example, the gradient is everywhere normal to the 


Algorithm 1 fc-NN REX Kernel 


Input: 

X = {xx, . . . ,Xn} 

[> Sample (training data) 

fc G { 0,..., n } 

t> Neighborhood size 

mG{l, ...,fc-|-l} 

> KCS size 

; G N 

> Population size 

Output: 

> Synthetic population 

1: for all a: G X do 

2: calculate the fc-NN of x, say Xix,k) 

3: end for 

4: y ^ 0 

5: repeat 

6: draw xi from X at random 

7: draw X2, ■ ■ ■, x^ (xi f Xj) from X(a;i, fc) at random 

8: generate y by (8) 

9: y<-yu{y} 

10: until \Y\ = 1 

11: return Y 



circumference of the circular density and becomes very large 
near the circumference. Thus, the bandwidth shrinkage does 
occur in the normal directions and does not in the tangent 
directions with respect to the density manifold, resulting in a 
better fit shown in Fig. |2] Accordingly, our matrix extension of 
bandwidth will gain some improvement in finite sample size. 

Another advantage of our method over the BMP estimator 
in finite sample size is that the BMP bandwidth depends on 
only the fcth nearest sample point and the other k—1 points are 
ignored while our bandwidth is determined by all the k points. 
This will make the bandwidth computation more stable. 


C. Bagging of Kernels 

Even our A:-NN MLE is stabler than BMP’s fc-NN distance, 
the severe variability of high-dimensional samples will still 
fluctuate the resulting kernel. In order to exploit the sample 
information for further variance reduction, we employ the 
bagging ifT^ of kernels. That is, our KCS is not the k-NN 
itself, but a subset of it, and we change the KCS every time 
a resampling point is generated. The resulting entire estimate 
becomes as follows: 


Hy) = -Y.^xi[K{y\X')] 

71 < ^ * 


( 11 ) 


where Ex' [ ■ ] denotes the expectation over all possible choices 
of KCS X) of size m from the fc-NN of Xi. 

Of course, we cannot compute the value of /(y) since 
Ex'\K{y\X[)\ is a mixture of too many Gaussians, but we 
can easily draw a sample from it by simply choosing Xi and 
its fc-NN points at random and passing them to ®. We end this 
section by presenting the resampling algorithm via the fc-NN 
REX kernel in Algorithm [1] 


V. Experiments 

In order to evaluate the proposed method, we conduct 
numerical experiments on benchmark datasets. The population 
reproducibility, the parameter sensitivity, the bagging effect 
and the CPU time are studied. 












TABLE I. 


Description of datasets. 


Name Dimen.sion Sample Size URL 


PUMS Person in WA 

2 

69301 

m 

Swiss Roll 

3 

10000 

24 

3D Road Network 

4 

434874 

25 

Letter Recognition 

17 

20000 

25 


A. Methods and Parameter Settings 

k-NN REX kernel: Algorithm [T] was used as the proposed 
method. Its parameters were set to be as follows: the neighbor¬ 
hood size k = 0,, 99, and the KCS size m = 1,..., fc + 1 
for each k. 

KM REX kernel: As a state-of-the-art crossover kernel, 
we combined the Kimura-Matsumura (KM) model (|7]i with 
the REX kernel Q. Its parameters were set to be as follows: 
the number of KCSs L = 1,..., 100 and the KCS size m = 
d -f 1,..., d -f 10. The optimization of KCS choice ran until 
consecutive 10000 iterations do not increase the log-likelihood. 

BMP Gaussian kernel: O, (HI with Hi = hfl and 

(01 was used to see our performance progress owing to the 
bandwidth matrization and the bagging. Its parameters were 
set to be as follows: the neighborhood size k = 1,..., 99 and 
the constant multiplier h — 0.01,..., 1.00. 

Fixed Gaussian kernel: ([T| and (|2| with Hi = h'^I was 
used as a conventional estimator^ Its parameter, the bandwidth, 
was set to be h = 0.001,..., 0.100. 

B. Datasets 

To examine the effect of dimensionality and sample size, 
four datasets shown in Table |T] were used. Every dataset was 
affinely transformed by the whitening ll23l so that its mean 
and covariance can be the zero vector and identity matrix, 
respectively. All the datasets and their detailed descriptions are 
available online; consult the references in the URL column of 
the table. 

C. Criterion 

In order to evaluate the accuracy of population reproduc¬ 
tion, we measured the Hellinger distance ll26ll between test 
data and a synthesized population. Each dataset was evenly 
divided into 100 subsetsjj and then used for the inverted cross- 
validation (ICV): one subset was used as training data and the 
remaining 99 subsets constituted test data. Since we cannot 
compute the value of the fc-NN REX kernel’s density dm 
as stated in Section IIV-CI a binned version of the Hellinger 
distance should be used. For a reconstructed population Y and 
a set of test data Z, our performance measure can be written 
by 

H(r, Z) = (vWTm - ^\Zi>\ / |Z|)~ (12) 

V ^ beB 

where B is the set of bins, Y}, (resp. Zb) is a subset of Y 
(resp. Z) whose elements fall in the bin b, and | • | is the 

^This is virtually (but, not exactly) identical to using a fixed matrix 
bandwidth, i.e., Hi = ■ ■ ■ = Hn in m, because our datasets are normalized 
by an affine transformation. 

"^For datasets that cannot be divided by 100, the remainders were randomly 
dropped. 


set cardinality. When H(y, Z) < H(y', Z) holds, the method 
that generates Y has a better accuracy than the method that 
generates Y'. 

For each setting, the 100-fold ICV was done once, and 
then the average and standard deviation over its 100 Hellinger 
distances were calculated. 

D. Results 

Table HD shows the Hellinger distances from the test data 
of 99n points to the population of 99n points synthesized by 
each method with n training data. The Hellinger distances to 
the training data themselves are also displayed as a baseline. 
Note that their values are attainable by simply copying or 
bootstrapping the training data. 

Every method much better reproduced the population than 
the baseline for all datasets, except the KM REX kernel for 3D 
Road Network. In this sense, the kernel density estimation is, 
more or less, generally preferable to copying and bootstrapping 
approaches for the purpose of population synthesis, especially 
when the sample is small and complex. Among others, the k- 
NN REX kernel gained further significant improvements in all 
cases. For visual comparison, we provide the scatter plots of 
synthetic populations in Appendix iBl 

A surprising fact can be found in the comparison of 
their standard deviations. The fc-NN REX kernel’s standard 
deviations were less than the KM REX kernel’s and as low 
as the other three cases. Remember that our model is a 
matrix bandwidth extension of the BMP model and uses a 
lot more kernels than the KM model; thus it seems natural to 
expect our standard deviations have the greatest values. That 
paradoxical result implies the success of our variance reduction 
techniques. It holds good even in Letter Recognition, i.e., a 
high-dimensional (d = 17), small sample size in = 200) case. 

E. Discussions 

1) Parameter Sensitivity: Figure [3 gives a survey of all 
results of the fc-NN REX kernel. The hlled area in each 
plot indicates the parameters whose results surpass the best 
performance of the hxed Gaussian kernel. Every hlled area 
has a horizontally long part, which means the performance of 
the fc-NN REX kernel is insensitive to the neighborhood size 
k. Our observation is consistent with the sensitivity analysis 
for k of the BMP estimator studied by Breiman et al. M- 

The hgure also clarihes that the choice of the KCS size m 
is critical to the performance. How should practitioners hnd 
a suitable m? For PUMS Person in WA {d = 2), the optimal 
value is TO = 3. This agrees with a common practice in genetic 
algorithm: if the search space is of d dimensions, then the 
crossover should use m = d + 1 parents 11211 . For Swiss Roll 
{d = 3), 3D Road Network {d = 4) and Letter Recognition 
(d = 17), the optimal to’s are 3, 2 and 4, respectively. 
It looks inconsistent, but once their intrinsic dimensions are 
considered, the same rule can be found. Swiss Roll is a 2- 
dimensional spiraling band distribution. In 3D Road Network, 
the data are observed along roads, and the intrinsic dimension 
is therefore 1. The dimensionality of Letter Recognition can 
be reduced to about 3 by nonlinear dimensionality reduction 
methods Et). These remind us of a practice in multiobjective 













TABLE II. 


HELLINGER DISTANCE (AVERAGE ± STANDARD DEVIATION OVER 100-EOLD ICV) WITH BEST PARAMETER SETTINGS. IN EACH ROW, THE 
BOLDED AVERAGE IS SMALLER THAN ANY OTHER ONE WITH STATISTICAL SIGNIFICANCEp < 0.01 BY WELCH’S t-TEST. 


Dataset 

(d dimensions, n training points) 

fc-NN 

REX Kernel 

KM 

REX Kernel 

BMP 

Gaussian Kernel 

Fixed 

Gaussian Kernel 

Training Data 
(Baseline) 

PUMS Person in WA 
(d^ 2,n ^ 693) 

2.14e-l ± 6.82e-3 

{k — 66, m = 3) 

2.S7e-l ± 2.06e-2 
{L — 35, m — 4) 

2.46e-l ± 9.74e-3 
{k ^2,h ^ 0.99) 

3.39e-l ± I.09e-2 
{h = 0.011) 

5.11e-l ± 5.73e-3 

Swiss Roll 

(d ^ 3,n ^ 100) 

3.41e-l ± 1.48e-2 

(fc — 12, m — 3) 

3.S9e-l ± 2.63e-2 
{L — 66, m — 6) 

3.74e-l ± 1.32e-2 
{k ^22,h ^ 0.12) 

4.02e-l ± 1.24e-2 
(h ^ 0.056) 

6.65e-l ± 1.34e-2 

3D Road Network 

(d ^ 4, n ^ 4348) 

1.95e-l ± 1.Ole-3 
{k — 30, m = 2) 

3.74e-l ± 4.20e-3 
(L — 96, 771 — 3) 

I.98e-1 ± 3.33e-3 
(k = 3,h = 0.18) 

2.36e-l ± 3.11e-3 
(h = 0.009) 

2.78e-l ± 3.80e-3 

Letter Recognition 
{d ^ 17, n ^ 200) 

5.07e-l ± l.OOe-2 

(k ^ 36, m ^ 4) 

5.64e-l ± 1.42e-2 
(L ^ 68, m ^ 20) 

5.45e-l ± 1.12e-2 
{k ^ 2,h^ 0.12) 

5.47e-l ± 1.08e-2 
{h ^ 0.059) 

7.00e-l ± 1.26e-2 
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Fig. 3. Parameter (fc,m) vs. Hellinger distance of the fc-NN REX kernel. The Hellinger distance is averaged over 100-fold ICV. The legend “iT < 
a (/9/5050) ■” means as follows: Hellinger distances at filled squares are better than the best tuned Gaussian kernel’s Hellinger distance a, and there are /3 
out of 5050 parameter settings that achieve better distances. The legend “Tf > a (/3/5050) □” should be read similarly. 


genetic algorithm: when the Pareto solution set forms a d'- 
dimensional submanifold in the d-dimensional search space, 
using m = d' + 1 parents exhibits a better performance than 
m = d + 1 1281. 

As a consequence, we suggest a parameter selection rule 
that determines k and m by the following steps: 

1) Choose k between 10 and 50. 

2) Identify the intrinsic dimensionality d' of the dataset 
by a priori knowledge or dimensionality reduction. 

3) Set m = d' -f 1, or cross-validate near d' if possible. 


2 ) Effect of Bagging: Our bagging is different from 
Breiman’s original proposal in some points. Given a sample 
of size fc, Breiman’s bagging makes some particular replicates 
of size k by random sampling with replacement. Our bagging 
aims to implicitly achieve similar effect by changing the KCS 
every time, which corresponds to taking all possible replicates 
of size m by random sampling without replacement. 

Despite those differences. Fig. |3] shows that to get better 
results, users should set k much greater than m. If it is the case, 
many possibilities of KCS choice arise and the averaging effect 
of our bagging would be enhanced. Therefore, we can gain 
much variance reduction that will also decrease the Hellinger 






































































































































































































































































distance. 

3) Computational Cost: To analyze the time complexity 
of the A:-NN REX kernel, we examine Algorithm [T] Lines 
1-3 compute the /c-NN of n points of dimension d. We 
implemented this by a naive 0{dn?) time algorithm that 
calculates the Euclidean distances among points. Lines 5-10 
compute I points, each of which is generated by (|8]l in 0{dm) 
time. Therefore, the overall complexity is 0{dii? + dim). In a 
similar way, we also get the complexity of the BMP estimator 
as 0{d'n? + dl). The difference between them is m in the 
second term, which is related to resampling. Usually m is 
small as it should be the intrinsic dimensionality plus one. 
Thus, the additional computation cost for our extensions will 
not be so significant in most cases. 

The actual computation time is shown in Table [HI] This 
was measured by running a single threaded C++ program on a 
server machine: Intel Xeon 3.07 GHz, 64-bit, 30 GB RAM. Eor 
each method, the time includes reading the dataset from a file, 
computing the density, synthesizing the population and writing 
it to a file under one particular parameter setting; excludes 
selecting parameters and calculating the Hellinger distance. 
The table shows that for every dataset, the time of the k- 
NN REX kernel is comparable to the BMP Gaussian kernel. 
This result supports the theoretical comparison of their time 
complexities. Our method is approximately four times faster 
than the KM REX kernel in all cases, which implies that in 
KCS choice, our fc-NN restriction has a certain advantage over 
KM’s log-likelihood optimization in terms of speed. The table 
also shows that the fixed Gaussian kernel is much faster than 
our method. However, this is the result excluding the time of 
bandwidth selection. In practice, our method has a parameter 
selection rule presented in Section IV-Ell which enable us 
to find a good parameter setting within several parameter 
examinations. The other methods need a kind of grid search, 
typically surveying tens or hundreds of parameter settings. 
Eor fairer comparison, if the time of parameter selection is 
taken into consideration, our method would be the fastest. 
The bottleneck of our method, and the BMP estimator, is the 
computation of A:-NN as their complexities imply. In fact, for 
3D Road Network and Letter Recognition, it consumes more 
than 90% of overall time. If one would like to accelerate it, a 
variety of nearest neighbor search techniques are available. 

VI. Application to Urban Micro-Simulation 


Algorithm 2 /c-NN REX Kernel with Bias Correction 


Input: 

X = {xi, . . . ,Xn} 

F = {Ft eN\b e B} 
fc G { 0,..., n } 
mG{l, ...,fc + l} 

I G N 


> PUMS household microdata 
o SF household marginal frequencies 
> Neighborhood size 
> KCS size 
> Number of households 


Output: 

Y = { III, ■ ■ •, 31; } C K'^ > Synthetic households 

1: for all a: G X do 

2: calculate the fc-NN of x, say X{x,k) 

3: end for 
4: y ^ 0 
5: repeat 

6: find the most vacant bin b* = argmax;, Ff, — | Yj | 

7: if Xj. ^ 0 then 

8: draw xi at random from Xj* 

9: else 

10: generate a;i by sampling from the uniform distribution on b* 

11: calculate the fc-NN of a;i 

12: end if 

13: draw X 2 ,. .., Xm (a:; A front X(3:i, k) at random 

14: generate y hy ^ 

15: Y<-YU{y} 

16: if 36 G B : 1X61 > ^6 then 

17: remove a point from y6 at random 

18: end if 

19: until \Y\ = I 

20: return Y 


and Seattle, Washington; and in Europe: Paris, Prance; Brus¬ 
sels, Belgium; and Zurich, Switzerland. UrbanSim requires 
data that describe detailed attributes of every household in the 
study area as a part of the initial condition for simulation. In 
the United States, the U.S. Census Bureau publishes annual 
survey results for a 1% or 5% sample of households living 
in each area as Public Use Microdata Sample (PUMS). The 
marginal distribution of each attribute of households is pub¬ 
lished by Census Summary Piles (SP). Combining them, one 
can generate households to run UrbanSim. 

B. PopGen 

To supply households of UrbanSim, PopGen Il29l is avail¬ 
able. This software implements Iterative Proportional Updating 
(IPU) ITJ that weights each sample point to synthesize the 
population where the weights are determined so that the 
population’s marginal distributions match to the given SP 
marginals. 


In the last section, we have seen that the kernel density 
estimation generally enjoys significant improvements over 
sample-copying or bootstrapping approaches, given an unbi¬ 
ased sample, i.e.. Case I. Now we proceed to Case II, a biased 
sample with marginal frequencies. Our interest in this section 
is whether that is yet beneficial for multi-agent simulations. 

A. UrbanSim 


C. k-NN REX Kernel with Bias Correction 

To apply the fc-NN REX kernel to this task, we need a 
little modification so that its outcome matches to SP marginals. 
Algorithm|2]is the version to do so. In Algorithmic Fi, denotes 
the frequency in the bin b and 1), denotes the subset of Y whose 
elements are in the bin b. Since UrbanSim requires integer 
attributes, we round the attributes of real numbers generated 
by this algorithm. 


UrbanSim HI is one of the most advanced urban micro¬ 
simulators. It simulates location choice of business units and 
households, development and pricing of land and real estate, 
and govermental policies involved in the study area, and can 
investigate their change over years in parcel level. There are 
several applications in U.S. cities: Honolulu, Hawaii; Eugene- 
Springfield, Oregon; Detroit, Michigan; Salt Lake City, Utah; 


D. Experimental Settings 

We used the PUMS 1% sample of households in Seattle 
in 2000 and extracted six attributes from them: age of head, 
annual income, number of people, number of kids, race, 
number of workers. PopGen’s IPU and our method generated 
258,469 households (the number of households in Seattle in 








TABLE III. 


CPU TIME IN SECONDS (AVERAGE ± STANDARD DEVIATION OVER 100-FOLD ICV). BOLD FONTS MEAN AS IN TABLeITII 


Dataset 

(d dimensions, n training points) 

fc-NN KM BMP Fixed 

REX Kernel REX Kernel Gaussian Kernel Gaussian Kernel 

PUMS Person in WA 
(d = 2, n = 693) 

1.67e+0 ± 2.99e-2 6.67e+0 ± 2.13e-l 1.56e+0 ± 3.68e-2 5.19e-l ± 3.29e-3 

(/c = 66,m = 3) (L = 35,m = 4) (k^2,h = 0.99) {h = 0.011) 

Swiss Roll 
(d ^ 3,n = 100) 

3.04e-l ± 4.74e-3 1.26e+0 ± l.Ole-1 2.87e-l ± 4.58e-3 1.56e-l ± 2.76e-3 

(/c=12,m = 4) (L = 66,m = 6) (fc = 22,/i = 0.12) (^ = 0.056) 

3D Road Network 

(d ^ 4, n = 4348) 

2.01e+l ± 3.07e-l 7.41e+l ± 2.47e+0 L96e+1 ± 3.11e-l 3.I6e+0 ± 3.04e-2 

(k = 30,m = 2) (L = 96.m = 5) (fe = 3. ft, = 0.18) (ft = 0.009) 

Letter Recognition 

(d ^ 17, n = 200) 

2.27e+0 ± 3.61e-3 8.39e+0 ± 3.13e-l 2.26e+0 ± 3.99e-2 2.06e-l ± 3.34e-3 

(fc = 36,m = 4) (L = 68,m = 20) (k = 2,h = 0.12) (/i = 0.059) 


TABLE IV. Hellinger distance between synthesized 

HOUSEHOLDS AND PUMS (AVERAGE ± STANDARD DEVIATION OVER 10 
TRIALS). Bold eonts mean as in TableITII 


Year 

fc-NN REX Kernel 

PopGen’s IPU 

2000 (before UrbanSim) 
2012 (after UrbanSim) 

3.94e-l ± 1.99e-2 
4.04e-l ± 2.21e-2 

4.66e-l ± 2.13e-2 
4.79e-l ± 2.52e-2 


2000). With the households generated by each method, we 
ran UrbanSim from 2000 to 2012. We evaluated the Hellinger 
distance between the simulated households and the actual ones 
of PUMS 5% sampl^^. PopGen 1.1 was used as a baseline, 
but note that our settings do not involve any person attribute. 
We used UrbanSim 4.4.0 and kept all settings, other than 
initial households, their default values in the demo project hie 
“seattle_parcel_default.xml”. We used fc = 50 and m = 7 
for our method, according to our parameter selection rule in 
Section IV-Ell 

E. Results and Discussions 

The result is shown in Table IIVI Before simulation, the 
fc-NN REX kernel’s households are closer to the PUMS 5% 
sample than IPU’s. This tendency remains through the simula¬ 
tion, resulting in a better prediction. We guess this is because 
the diversity of household attributes is better reproduced by 
our method. IPU simply copies the given sample to synthesize 
the population, which never includes households that exist in 
a real-world city but are not present in the PUMS 5% sample. 
In contrast, the fc-NN REX kernel estimates the density of the 
underlying population and resamples in a probabilistic manner, 
which can generate households that are not present in the given 
sample. 

VII. Conclusions 

In this paper, we proposed the fc-nearest neighbor crossover 
kernel for population synthesis. We showed that our method 
outperforms, in accuracy and speed, the conventional fixed 
kernel, the Breiman-MeiseTPurcell variable kernel and the 
Kimura-Matsumura crossover kernel. We gave a parameter 
selection rule and a rationale for our method. We demonstrated 
the usefulness of our method for a household synthesis task in 
urban simulation. 

Eor future work, we plan to apply the proposed method 
to other real-world tasks, which will reveal the generality 
of the method. Especially, oversampling for imbalanced data 
would be a fruitful application. Another direction is to further 
interchange knowledge between kernel density estimation and 

^In evaluation, the sampling bias on the PUMS was corrected by using 
PUMS weights. 


genetic algorithm. Theories developed in the former commu¬ 
nity may help to understand why genetic algorithms work. 
Algorithms developed in the latter community can provide 
how to calculate large-scale estimators that are theoretically 
too complex to derive. 

Appendix A 

Derivation of Equation (fTol i 

Loftsgaarden & Quesenberry lf30ll showed that in the con¬ 
sistency analysis of their fc-NN density estimator, if k is chosen 
as a non-decreasing sequence of positive integers such that 
k ^ oo and fc/n —>■ 0 as n —^ c», then the fc-NN ball tends to 
contain arbitrarily many points while its volume converges to 
zero in probability as n increases. Our analysis is motivated by 
their result. However, to simplify the situation, we will evaluate 
the analytic covariance matrix (instead of the sample one) of 
a random variate on a ball whose radius deterministically goes 
to zero. Thus, this is not a rigorous proof but still gives an 
insight into the asymptotic behavior of the sample covariance 
of fc-NN points in our REX kernel as n ^ oo. Assume the 
density / is continuously differentiable and has a non-zero 
value at the point of consideration. 

We consider the first-order Taylor series approximation to 
the density around any point x G to which the fc-NN 
converges. Since the covariance matrix is translation invariant, 
we can fix a; = 0 without loss of generality: 

f{y) = /(o) + ./'(o)y + O(y^). 

where/' = (V/p = (M,...,j^). 

Integration over a d-ball B with radius 6 centered at the 
origin is written by 

P p6 p27T p7T p7T 

/ 0(y)dy = / / / •■ • / 

JB Jo Jo Jo Jo 

.. .,9d-i,r) 

d9i...d9d-idr 

where 

$(01,.. .,9d-i,r) = 

J-1 

(rcos^i,rsin^i cos^ 2 , • ■ • sin^^) cosOj^ 

i^l 

d-l d-l 

..., 'r( sin 9i) cos 0d_i, r sin 9i) 

2=1 i—1 















and 


d-2 

J(0i,.. .,6^-1, r) = 

i=l 


To calculate the integral, there are useful formulas for 
dehnite integrals of trigonometric functions: for any non¬ 
negative integers p, q, 


sinP6lcos« 0de 


sin^ 6 cos'^ Odd 


|B(E+i,a«) 

pB(tti.qi) 


(q even), 

(q odd), 

(p, q even), 
(otherwise) 


where 


B{x,y) 


r(x)r(p) 

T{x + y) 


is the beta function and T is the gamma function which 
satishes: for any non-negative integer r, 


r(r-f 1) 

r(r + i) 


= r!, 

(2r-l)!! ^ 
=--v"^ 

rtr * 


[2r-l]! 

2[2r-l][^_ 


1]! 




with [a;] = max(a;,0). Using these formulas and the property 
that (j){y)dy = 0 for any odd function (j){—y) = —(j){y), 
we get the following equations: 


where 


V 


y 

y 


2 

3 


/ dy = V5\ 

JB 

/ ydy = 0, 

JB 

[ y^dy = 0 
Ib 


—^- (volume of unit ball), 

r(f+ 1) 

y ® y^ (d X d matrix), 

y 0 y^ ® y (d x d x d tensor). 


Let us consider the density restricted to B: 

9{y) = f{y\y &b) = -f{y) 

z 

where 

z = ( /(y)dy 

J B 

~ [ /(o) + ./'(o)ydy 

J B 

= /(O) [ dy + /'(o) [ ydy 

JB JB 

= /(O) • V5<^ + /'(O) • 0 
= /(0)U<5‘'. 


Suppose y is a random variate with density g. The mean 
of y is 


E[y] = / g{y)ydy 
J B 


Z JB 
1 


f{y)ydy 

/(0)y + (/'(0)y) ydy 


= ^ ^f(O) J ydy -f J y^dyV/(0) 
= -(/(0)-0 + -^i-v/(0) 


1 


d + 2 
Vd‘^+^ 


f{0)V6‘i d + 2 
V/ 


V/(0) 


(d + 2) / 


■( 0 ). 


The mean of y^ = y ® y^ is 
E[y^] = f g{y)y‘^dy 

JB 


Z , 

2 Jb 


/(y)y^dy 

/(O)y^ + (/'(0)y)y^dy 


= “ (^/(O) J y^dy+ /'(0) 


y^dy 


= - ( /(O) • 


1 


d + 2 


7 + /'( 0).0 


mv^j. 

/(O)Ud'^ ‘ d + 2 






d + 2' 

Finally, the covariance of y is 
C[y] = E[y2] - E[y]2 
d2 


d + 2 


I- 
I - 




(d + 2) / 

<54 V//' 


d + 2 (d + 2)2 p 


( 0 ). 


This result implies that our fc-NN REX kernel, more gener¬ 
ally, generic A:-NN crossover kernels such that their crossovers 
are compliant to Kita’s preservation of statistics, has the same 
order of bandwidth as the BMP estimator. Especially, if we 
choose the KCS size of the fc-NN REX kernel as m = k+1 and 
the constant multiplier of the BMP estimator as d = 1 /Jd + 2, 
then both asymptotically coincides. Consequently, the fc-NN 
REX kernel can be considered as a matrix bandwidth extension 
of the BMP estimator. 


Appendix B 
Visual Comparison 

Eigure |4] shows the distribution of synthetic populations 
for PUMS Person in WA. These plots intuitively support the 




















correctness of our numerical evaluation presented in Table [III 
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Fig. 4. Synthetic populations, training data and test data for PUMS Person 
in WA. 
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